Association of data to a biological sequence

ABSTRACT

A computer assembly includes a processor configured to access data on a network and to perform a method. The method includes identifying, in the network, one or more references having a relevance level greater than a predetermined threshold. The one or more references are associated to one or more probe sequences corresponding to one or more biological sequences. The one or more probe sequences are ranked based on one or more criteria corresponding to a target biological sequence. The one or more probe sequences are assigned with a level of affinity to one or more segments of the target biological sequence based at least on the ranking of each of the one or more probe sequences.

BACKGROUND

The present disclosure relates to a simulated binding of data to abiological sequence, and in particular to identifying data that isrelevant to a biological sequence, ranking the data according to itsimportance, and providing the data to a user according to the ranking.

Analysis of biological data, including biological sequences, may requirelarge amounts of data stored on different computers to perform theanalysis. Biological data being researched may be annotated by a programto refer to data, such as research publications, related to thebiological data. This allows a researcher to see other data that isrelated to the present research of the biological data. While researcharticles and other publications are useful for analyzing biologicaldata, other resources are also useful, such as analysis tools andsoftware programs. In addition, over time, the amount of informationregarding biological data being resourced grows. When biological data isannotated to refer to related publications, the annotations alsoincrease, which may make it more difficult for a researcher to identifythe important information related to present research.

SUMMARY

Exemplary embodiments include a computer assembly for associating datawith a biological sequence. The computer assembly includes a processorconfigured to access data on a network and to perform a method. Themethod includes identifying, in the network, one or more referenceshaving a relevance level greater than a predetermined threshold. Eachreference of the one or more references is associated with a probesequence corresponding to a segment of a biological sequence. The methodincludes ranking one or more probe sequences based on one or morecriteria and assigning the one or more probe sequences with a level ofaffinity to a segment of a target biological sequence based at least onthe ranking of each probe sequence.

Embodiments further include a system for simulating annealing to abiological sequence. The system includes one or more network computershaving stored therein data and a host computer. The host computer hasstored therein a biological sequence. The host computer is connected tothe one or more network computers via a communications network. The hostcomputer is configured to identify data in the one or more networkcomputers as relevant data that is relevant to the biological sequenceand to associate the relevant data with a segment of the biologicalsequence. The host computer is further configured to rank the relevantdata based on predetermined criteria to determine a level of affinity ofthe relevant data with the segment of the biological sequence.

Embodiments further include a computer program product for simulatingannealing to a biological sequence. The computer program productincludes a processor and a non-transitory computer readable mediumhaving stored thereon code to perform a method. The method includesidentifying, by the processor, references to data in a network asrelevant references that are relevant to a biological sequence andassociating the relevant references with a segment of the biologicalsequence. The method includes ranking the relevant references based onpredetermined criteria to determine a level of affinity of the relevantreferences with the segment of the biological sequence.

Additional features and advantages are realized by implementation ofembodiments of the present disclosure. Other embodiments and aspects ofthe present disclosure are described in detail herein and are considereda part of the claimed invention. For a better understanding of theembodiments, including advantages and other features, refer to thedescription and to the drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The subject matter which is regarded embodiments of the presentdisclosure is particularly pointed out and distinctly claimed in theclaims at the conclusion of the specification. The forgoing and otherfeatures, and advantages of the embodiments are apparent from thefollowing detailed description taken in conjunction with theaccompanying drawings in which:

FIG. 1 illustrates a network system according to embodiments of thepresent disclosure;

FIG. 2 illustrates a simulated annealing module according to embodimentsof the disclosure;

FIG. 3 illustrates a user customization display according to embodimentsof the disclosure;

FIG. 4A illustrates an annealing display according to an embodiment ofthe disclosure;

FIG. 4B illustrates an annealing display according to an embodiment ofthe disclosure;

FIG. 5 illustrates an annealing display according to another embodimentof the disclosure;

FIG. 6 illustrates a table according to embodiments of the disclosure;

FIG. 7 illustrates a flowchart of a method according to embodiments ofthe disclosure;

FIG. 8 illustrates a computer system according to embodiments of thedisclosure; and

FIG. 9 illustrates a computer program product according to embodimentsof the disclosure.

DETAILED DESCRIPTION

The large volume of data that may be annotated to a biological sequencemay make it difficult for a researcher to identify important data.Embodiments of the present disclosure relate to displaying a simulatedannealing of data and references to a biological sequence to allowresearchers to quickly identify important information.

FIG. 1 illustrates a network system 100 according to an embodiment ofthe present disclosure. The system 100 includes a host computer 110including a simulated annealing module 111, a biological sequence 112(also referred to as a target biological sequence 112) and data 113. Thesimulated annealing module 111 is configured to analyze data andreferences to the data, to determine which data is relevant data 114 tothe biological sequence 112, and to determine a level of affinity of therelevant data 114 to the biological sequence 112 based on predeterminedranking criteria. The host computer 110 may display the determined levelof affinity by displaying the biological sequence 112, displayingrelevant data 114, symbols representing the relevant data 114, orsymbols representing references to the relevant data 114, and adjustinga distance of the relevant data 114 (or corresponding symbols) from thebiological sequence 112 based on the level of affinity of the relevantdata 114 with respect to the biological sequence 112.

The host computer 110 may be connected to a network 120. The network 120may communicate with one or more network computers 130, which in thepresent specification and claims refer to computers connected to thenetwork 120 to communicate via the network 120. The network computers130 may include data 131, such as documents 132, analysis tools 133 andbiographical data 134. While only a few types of data are illustratedfor purposes of description, any type of data may be stored by thenetwork computers 130. The simulated annealing module 111 may access thedata 131 to determine which data 131 is relevant to the biologicalsequence 112. The simulated annealing module 111 may rank the relevantdata 131 based on the predetermined ranking criteria to determine alevel of affinity of the data 131 to the biological sequence 112.

The host computer 110 may also be connected to one or more storagedevices 140, and the storage devices may store one or more references141 pointing to data 142, and one or more biological sequences 143 thatmay be target biological sequences, or biological sequences againstwhich data is compared to determine a level of affinity. For example,the storage 140 may contain a database of biological sequences 143 and auser of the host computer 110 may upload a biological sequence 143 fromthe storage 140 to the host computer 110 to allow the simulatedannealing module 111 to perform an analysis of data, such as data 113,data 131 and data 142 with respect to the biological sequence 143, todetermine the level of affinity of the data to the biological sequence143.

In addition, the network 120 may access one or more of storage 170 andnetwork computers 180 via a server 150 connected to the Internet 160.Alternatively, the host computer 110 may directly connect to theInternet 160. The storage 170 and network computer 180 may include data,references and biological sequences accessible by the host computer 110to perform analysis.

In embodiments of the present disclosure, the biological sequence 112 or143 may include any type of biological sequence, includingdeoxyribonucleic acid (DNA), ribonucleic acid (RNA), an amino acidsequence of a protein, or any other biological sequence. Data includesdocuments, files, stored biographical information of a person orinformation of an organization, stored publications, data regarding anumber of queries of the simulated annealing module 111 or othersystems, data regarding previous analysis performed on the biologicalsequence 112, analysis tools, algorithms or programs, medical treatmentsassociated with the biological sequence 112, data regarding comments orreviews of publications or tools, or any other data. References includeany pointer or address that indicates a location of data or providesadditional information regarding the data. Examples include uniformresource locators (URLs), uniform resource names (URNs), hyperlinks,javascript pointers to data, XML pointers to data or any other type ofreference to data.

An operation of the host computer 110 including the simulated annealingmodule 111 will be described below with reference to FIGS. 1 and 2. Thesimulated annealing module 111 may include a biological sequenceidentifier 206 to identify a target biological sequence 112. Forexample, a user accessing the host computer 110 may display thebiological sequence 112 or data corresponding to the biological sequence112 on a display device. Alternatively, the simulated annealing module111 may automatically, or based on predetermined commands to identify apredetermined biological sequence or a predetermined class or group ofbiological sequences, search one or more of the host computer 110,storage 140 and 170, and network computers 130 and 180 to identifybiological sequences to be target biological sequences 112. In thepresent specification and claims, a “target biological sequence” isdefined as a biological sequence that is selected by a user or programto be subject to ranking of related data, and in some embodimentssimulated annealing, as described in embodiments of the disclosure.

The simulated annealing module 111 may include a reference identifier201, a reference generator 202, a relevance identifier 203 and areference/data associator 204. The reference identifier 201 may searchmemory of a device, such as the host computer 110, of connected storagedevices 140, of devices 130 connected to a network 120, or of devices170 and 180 connected to the Internet 160 for references to data, suchas URLs that refer to data at a particular location. In addition, incircumstances in which data does not correspond to a reference, or thereference is not in a format usable by the simulated annealing module111, the reference generator 202 may search memory of a device, such asthe host computer 110, of connected storage devices 140, of devices 130connected to a network 120, or of devices 170 and 180 connected to theInternet 160 for data, such as documents, biographical data, datarelated to analysis tools, and any other data. The reference generator202 may then generate a reference, such as a URL that points to alocation of the data.

The relevance identifier 203 analyzes the data, such as the data pointedto by the searched references or the data identified by the referencegenerator 202, to determine whether the data meets a threshold level ofrelevance. The threshold level of relevance may be based onpredetermined criteria, such as a similarity of the data to a targetbiological sequence, a source of the data, such as an organizationsupplying the data (e.g. university, company, etc.), an author of thedata, a publisher of the data, and a type of operation performed byexecution of the data (such as in the case of an analysis tool foranalyzing biological sequences). The threshold level of relevance mayalso be based on a frequency with which the data is accessed orreferenced, a frequency with which the data is accessed or referenced bypredetermined classes, such as researchers, scientists, professionalorganizations, etc., or a frequency with which the data is associatedwith a target biological sequence. In other words, the threshold levelof relevance may be related to a target sequence or may include criteriaunrelated to the target sequence. The threshold level of relevance maybe based on the content of the data, such as an identity of a person ororganization that is the subject of biographical information data, orcontent of a document or file. In addition, the threshold levelrelevance may be based on usage of the data, such as how often the datais accessed or referenced or by whom the data is accessed or referenced.

Based on a determination that the data meets a threshold level ofrelevance, the reference/data associator 204 associates a reference,either identified or generated, with the data. For example, thereference and data, or information identifying the data, may be storedin a reference table 205. The probe generator 207 may generate a probeor probe sequence and the probe/reference associator 208 may associatethe probe sequence with the reference, such as by adding the probesequence to the reference table 205. In one embodiment, the proberepresents a degree to which the reference, or the data associated withthe reference, corresponds to a particular segment of a biologicalsequence to which the data pertains. The segment of the biologicalsequence to which the data pertains may be a portion less than theentire biological sequence, but in some examples the segment couldcorrespond to the entire biological sequence. In one embodiment, theprobe is identified by a sequence that is complementary to a sequence ofa segment of the biological sequence to which the data pertains. Forexample, if a segment of the biological sequence to which the datapertains has a configuration of “GGGGAAAATT,” the probe may correspondto a complementary probe sequence, or “CCCCTTTTAA.” Accordingly, thehost computer 110 may match references and data to portions ofbiological sequences pertaining to each reference according to asequence indicated by a probe sequence. In other embodiments, the probeidentifies spatially, numerically or graphically a portion of thebiological sequence to which it pertains.

The rank calculator 209 may calculate a rank of each probe sequence, oreach reference, or of the data corresponding to each reference and probesequence. The ranking may be based on one or more criteria, and thecriteria may be weighted so that different criteria affect the rankingmore than other criteria. For example, a user may set up a profile inthe host computer 110 or the simulated annealing module 111, and theuser may indicate which indicia the user would prefer to be given themost weight. In one embodiment, the weighting criteria are analogous tobiological characteristics of a probe and biological sequence in abiological annealing process.

One criterion, which may be analogous to a biological complementarityrequirement, is a determination of a similarity, resemblance, or overlapof the data or the probe with a segment of the target biologicalsequence. For example, a document may explicitly describe the segment ofthe target biological sequence or an analysis tool may have been used toanalyze the segment of the target biological sequence. Alternatively, adocument may describe a similar, but not identical, biological sequence,which may result in a lower ranking.

Another criterion, which may be analogous to a binding affinity of aprobe in a biological annealing, is a determination of importance orprestige of a probe, data or reference. The importance of the data maybe determined based on information about the data that does notnecessarily relate to the target biological information. For example,the importance of the data may be based on one or more of a number ofcitations the authors of a document have received, the identities ofpeople or organizations that have referenced a document, a university ororganization where the data was generated, such as where research wasconducted, a type of analysis performed in a document, a name of anauthor of a document, or any other factors that may provide informationregarding the prestige or importance of data in a field. When the datarelates to an analysis tool, for example, the importance of the data maybe determined based on a type of analysis performed by the tool, auniversity or organization where the analysis tool was developed, acreator of the analysis tool, or any other factors that may provideinformation regarding the prestige or importance of the analysis tool.

Another criterion, which may be analogous to probe mobility in abiological annealing, is a determination of a popularity of data. Thedetermination may take into account a frequency with which the data isaccesses or cited, such as a frequency with which a document is cited inother publications or a frequency with which an analysis tool is used ina field. In an embodiment in which the data is biographical information,such as information about a researcher, the determination may consider afrequency with which the researcher is cited. In other words, whiledetermining the importance or prestige of the data relates to factorsthat are not directly associated with the data (such as a prestige ofthe source of the data), the popularity of the data may relate to thedata itself.

Another criterion, which may be analogous to occupancy constraints in abiological annealing, is a determination of a historical applicabilityof data to a target biological sequence, to the probe or to other probesequences. For example, the determination may be based on a frequencywith which an analysis tool has been used to analyze a target segment ofthe biological sequence, or a frequency with which a document has beencited in reference to the target segment of the biological sequence.

Although a few examples of ranking criteria have been provided,embodiments of the present disclosure encompass any ranking criteriaincluding prior uses of a tool, prior citations of a document, a qualityof citations to the data, affiliations of an author of data, ease-of-useof a tool, cost to implement an analysis tool, number of software ortool citations, a date of citations or references to the data, votes orother determinations of importance of data by crowd sourcing, content ofuser comments about the data, etc. In addition, in one embodiment astochastic element may be introduced to the rankings to allow foroptimization away from local minima.

In one embodiment, the probe sequence includes or has attached to it thedata associated with the ranking criteria. When a target biologicalsequence is identified, it may be compared with all of the availableprobe sequences, and the available probe sequences may be ranked basedon a similarity with segments of the target biological sequence.Accordingly, in one embodiment, the data and references identified inthe system are not directly compared with data associated with thetarget biological sequence. Instead, a probe sequence having storedtherein, attached thereto, or which inherently includes the rankingcriteria data is compared to segments of the target biological sequence.

The probe sequences may be generated by systems or users that generatedthe data and/or references, or the probe sequences may be generated by asystem or user that identifies a target biological sequence. Forexample, in one embodiment, a system that identifies a target biologicalsequence searches a network for previously-generate probe sequences andperforms the ranking operation. In another embodiment, the system thatidentifies the target biological sequence may search the network,identify relevant data, and generate the probe sequences correspondingto the relevant data. Then the data, references or generated probesequences may be compared to predetermined criteria and to the targetbiological sequence. In yet another embodiment, a system may combineboth analysis of pre-generated probe sequences with the generation ofnew probe sequences based on newly-identified data or references in anetwork.

Once the data, references or probe sequences have been ranked by therank calculator 209, the annealing display generator 210 generates agraphical display of the ranking. The graphical display may display anicon or other representation of the data, reference, or probe and of thetarget biological sequence, or one or more target segments of thebiological sequence. The annealing display generator may display theranking by displaying icons associated with data having a higher rankingas being located closer to a corresponding segment of the targetbiological sequence, while an icon associated with data having a lowerranking are located farther from the segment of the target biologicalsequence.

In embodiments of the present disclosure, the data may be analyzed bythe simulated annealing module 111 to determine relevant data and torank the data by analyzing one or more of key words in the data,frequency of keywords, groups of key words, frequency of groups ofkeywords, metadata associated with the data, or any other content of thedata or content related to the data.

According to embodiments of the present disclosure, data, references andprobe sequences may be competitively ranked to ensure that data,references and probe sequences determined to be of highest interest to auser are more closely associated with a target biological sequence beinganalyzed by the user.

In one embodiment, in addition to annealing one or more probe sequencesto a target biological sequence, the simulated annealing module 111 mayanneal one or more probe sequences to one or more other probe sequences.For example, if one probe sequence represents an analysis tool, anotherprobe sequence representing a program or tool for improving theefficiency of the analysis tool may be simulated as being annealed tothe first probe sequence. In another example, if a first probe sequencerepresents a software application, one or more probe sequencesrepresenting journal citations including formulas or analysis using thesoftware application may be simulated as being annealed to the firstprobe sequence.

In embodiments of the present disclosure, a user may determine settings,or may generate a profile, to adjust or alter the ranking and display ofdata, references, and probe sequences. FIG. 3 illustrates a display 300or graphical user interface (GUI) 300 which may be displayed on anelectronic display device, such as a computer monitor, to allow a userto set preferred weights. The display 300 includes rank criteria 301 a,301 b, 301 c and 301 d. In FIG. 3, the rank criteria include “similarityto target sequence” 301 a, “importance of reference” 301 b, “popularityof reference” 301 c, and “historical applicability to target sequence”301 d. However, embodiments of the present disclosure encompass anycriteria, including pre-set criteria or criteria generated by a user.

FIG. 3 further illustrates sub-ranking icons 302 a to 302 d, which mayallow a user to further specify ranking preferences. For example, underthe sub-ranking icon 302 b, a user may specify that in determining animportance of a reference, the organization with which an author orcreator is affiliated is more important, and receives a higher weight,than a total number of citations to the author. A user may also setminimum standards, such as a minimum number of citations to data or areference that is required for the data or reference to obtain anyranking.

The display 300 may further include fields 303 a to 303 d that are ableto be modified by a user to adjust the weighting desired by a user. Inone embodiment, the rank calculator 209 utilizes an algorithm to combinea user's selected weight of one or more criteria in combination withinformation contained within the relevant data or references, ormetadata associated with the data or references, to calculate a finalranking of the data, reference, or probe sequence.

FIGS. 4A and 4B illustrate displays 400 a and 400 b of a icons 402, 403and 404 representing data, references or probe sequences annealing totarget biological sequence 401 based on a ranking of the data,references or probe sequences. Alternatively, in addition to annealingto the target biological sequence 401, one or more of the icons 402, 403and 404 may be annealed to other icons among 402, 403 and 404,representing the annealing of a probe sequence to another probe sequencethat is annealed to the target biological sequence 401. Referring toFIG. 4A, icons 402, 403 and 404 having different visual characteristics,such as different cross-hatching, different colors, different shapes ordifferent graphic representations may correspond to different types ofdata or references. For example, an icon 402 having a first type ofcross-hatching may correspond to a document, an icon 403 having a secondtype of cross-hatching may correspond to an application and an icon 404having a third type of cross-hatching may correspond to biographicalinformation about a person, such as a researcher. Other types of datathat may be represented by the icons 402, 403 and 404 include data aboutan organization, such as a company or university, computer programinformation, project information about a research project, informationabout web pages that may contain relevant information, etc.

The icons are displayed to be vertically (in FIGS. 4A and 4B) alignedwith a segment of the target biological sequence 401 according to theprobe sequence associated with the data and reference represented by theicons 402, 403 and 404. In addition, the icons are displayed as being adistance away from a segment of the target biological sequence 401 (in ahorizontal direction in FIGS. 4A and 4B) based on a ranking of the dataor reference associated with the icons. For example, the icons labeled402 and 404 may be associated with the same or a very similar segment ofthe biological sequence, as indicated by the close vertical alignment oficons 402 and 404. However, icon 404 may be associated with data havinga higher ranking than icon 402, as illustrated by icon 404 being locatedcloser to the target biological sequence than icon 402 in the horizontaldirection.

In one embodiment, a user may retrieve the data represented by the icons402, 403 and 404, or may be provided with information regarding wherethe data is located, by selecting the icons 402, 403 or 404 with acursor, touch, or any other user interface. In one embodiment, differentranking characteristics may change an appearance of the icons 402, 403and 404. For example, an icon representing data that is referenced oftenmay have a larger shape than data that is referenced seldom. An iconrepresenting data that is available by clicking an icon may have adifferent outline than data that is not. An icon representing a personmay have an image of the person. An icon representing a product, such asan analysis tool or program, may have an icon or image associated withthe tool or program, such as a trademark.

In one embodiment, if a segment or adjacent segments of the targetbiological sequence 401 include a relatively large number of icons, thedisplay 400 a may generate a blob 405. When a user moves a cursor overthe blob 405 or performs any other action for selecting the blob 405,the individual icons may be shown and selected.

FIG. 4B illustrates a display 400 b of the same target biologicalsequence 401 as in FIG. 4A, but the ranking preferences are differentcorresponding, for example, to preferences selected by a different user.Accordingly, the icons 402, 403 and 404 may be arranged differently andmay have different numbers than in FIG. 4A. In embodiments of thepresent disclosure, a user may modify preferences of the informationthat the user considers important to personalize information displayedto the user related to a target biological sequence 401.

FIG. 5 illustrates a display 500 according to another embodiment of thepresent disclosure. In FIG. 5, the target biological sequence 501 isdisplayed by letters representing, for example, nucleotides, andcorresponding probe sequences 502 are represented by complementaryletters, or nucleotides that may bond with the nucleotides of the targetbiological sequence 501. Icons 503, 504, 505 and 506 represent differenttypes of data, such as publications, analysis tools, biographicalinformation, and web page information. The icons 503, 504, 505 and 506may be located in a horizontal direction (in FIG. 5) based on a segmentof the target biological sequence 501 to which the data represented bythe icons 503, 504, 505 and 506 is most closely related. The icons 503,504, 505 and 506 may be separated from the target biological sequence501 by a distance determined by the ranking of the data or referenceassociated with the icon 503, 504, 505 and 506.

As illustrated in FIG. 5, the icons 503 to 506 may contain informationrelated to the data that the icon represents. For example, the icons maycontain numbers to indicate a number of citations to the data byparticular sources. While examples of displays have been provided forpurposes of description, embodiments encompass any type of display inwhich a user may see an importance of data relative to a targetbiological sequence based on a distance of an icon representing the datafrom the target biological sequence. In one embodiment, a probe sequencemay be further bonded to one or more additional probe sequences. Forexample, a user may move a cursor over an icon or select the icon, andone or more additional linked icons may appear, corresponding toadditional data that is related to the data represented by the selectedicon. In one embodiment, the probe sequence may be treated as a targetbiological sequence, and data may be analyzed and ranked with referenceto the probe sequence in the same manner as for the original biologicalsequence.

FIG. 6 illustrates an example of a table 600 according to an embodimentof the present disclosure. The table 600 may correspond to the referencetable 205 of FIG. 2, for example. The table 600 associates a reference,such as a URL, URN, or other address, link or locator, with relevantdata and a probe sequence. Examples of relevant data have been discussedpreviously, and in FIG. 6 the probe sequence corresponds to a biologicalsequence that is complementary to a segment of a target biologicalsequence. The table 600 may further include icon information fordisplaying an icon representing the data or reference, or any otherinformation to be associated with the relevant data and reference. WhileFIGS. 2 and 6 illustrate tables to associate data, references to thedata and probe sequences, embodiments of the present disclosureencompass any data structures for associating data, such as arrays,pointers, or any other types of data structures with which a person orsystem could associate data with references to the data and with probesequences.

FIG. 7 illustrates a flow diagram of a method according to an embodimentof the present disclosure. In block 701 a reference, such as an addressor pointer to data may be found by searching memory in a computer,searching storage devices, searching devices connected to a host deviceconnected to a network, such as the Internet, etc. In addition,references to data found in one or more devices may be generated when noprevious reference is found, or when a particular type or format ofreference is desired.

In block 702, data is associated with the reference. For example, anentry may be formed in a table or another data structure may be formedto associated with reference with the data to which the referencepoints. In block 703, it may be determined whether the data is relevant.In other words, a threshold determination of relevance of the data maybe made. The threshold level of relevance may be based on predeterminedcriteria, such as a similarity of the data to a target biologicalsequence, a source of the data, such as an organization supplying thedata (e.g. university, company, etc.), an author of the data, apublisher of the data, and a type of operation performed by execution ofthe data (such as in the case of an analysis tool for analyzingbiological sequences). The threshold level of relevance may also bebased on a frequency with which the data is accessed or referenced, afrequency with which the data is accessed or referenced by predeterminedclasses, such as researchers, scientists, professional organizations,etc., or a frequency with which the data is associated with a targetbiological sequence. In other words, the threshold level of relevancemay be related to a target sequence or may include criteria unrelated tothe target sequence. The threshold level of relevance may be based onthe content of the data, such as an identity of a person or organizationbased on stored biographical information, or content of a document orfile. In addition, the threshold level relevance may be based on usageof the data, such as how often the data is accessed or referenced or bywhom the data is accessed or referenced.

If it is determined in block 703 that the data does not meet thethreshold level of relevance, the process with respect to that data endsin block 704. On the other hand, if the data is determined to besufficiently relevant, the data and reference may be associated with aprobe sequence in block 705. The probe sequence may identify at least aportion of a target biological sequence to which the data pertains. Inone embodiment, the probe sequence is represented by a complementarysequence to a corresponding segment of the target biological sequence.For example, if the biological sequence is a series of nucleotides, theprobe sequence may be a complementary series of nucleotides. In anotherembodiment, the probe sequence is merely data identifying the portion ofthe target biological sequence to which the data most closely pertains.In embodiments of the present disclosure, the data may pertain to onesegment of the target biological sequence or to more than one segment.

In block 706, the data is ranked based on predetermined criteria todetermine an affinity or bond between the data, reference, or probesequence and a segment of the target biological sequence. The rankingmay be based on one or more criteria, and the criteria may be weightedso that different criteria affect the ranking more than other criteria.Examples of ranking criteria include a similarity of the data to thetarget biological sequence, a relevance of the content of the data tothe target biological sequence, an importance or prestige of the data orreference, a popularity of the data or reference, and a historicalapplicability of the data or reference to target biological sequences.

In one embodiment, a user may add criteria, remove criteria, and adjusta weight of criteria used to rank relevant data, references to the dataand probe sequences associated with the data. In embodiments of thepresent disclosure, different users may generate different profiles ormay otherwise indicate different preferences for ranking informationrelated to a target biological sequence. In block 707, the relevant datamay be bound to the target biological sequence, or segments of thetarget biological sequence, based on the ranking. In particular, thetarget biological sequence may be displayed and icons may be displayedwith the target biological sequence representing the data, referencesand probe sequences.

A graphical display may display an icon or other representation of thedata, reference, or probe and of the target biological sequence, or oneor more target segments of the biological sequence. The ranking of thedata, references or probe sequences may be displayed by displaying iconsassociated with data having a higher ranking as being located closer toa corresponding segment of the target biological sequence, while iconsrepresenting data having a lower ranking are located farther from thesegment of the target biological sequence.

FIG. 8 illustrates a block diagram of a computer system 800 according toanother embodiment of the present disclosure. The computer 800 maycorrespond to the host computer 110 of FIG. 1, for example. The methodsdescribed herein can be implemented in hardware, software (e.g.,firmware), or a combination thereof. In an exemplary embodiment, themethods described herein are implemented in hardware as part of themicroprocessor of a special or general-purpose digital computer, such asa personal computer, workstation, minicomputer, or mainframe computer.The system 800 therefore may include general-purpose computer ormainframe 801 capable testing a reliability of a base program bygradually increasing a workload of the base program over time.

In an exemplary embodiment, in terms of hardware architecture, as shownin FIG. 8, the computer 801 includes one or more processors 805, memory810 coupled to a memory controller 815, and one or more input and/oroutput (I/O) devices 840, 845 (or peripherals) that are communicativelycoupled via a local input/output controller 835. The input/outputcontroller 835 can be, for example, one or more buses or other wired orwireless connections, as is known in the art. The input/outputcontroller 835 may have additional elements, which are omitted forsimplicity in description, such as controllers, buffers (caches),drivers, repeaters, and receivers, to enable communications. Further,the local interface may include address, control, and/or dataconnections to enable appropriate communications among theaforementioned components. The input/output controller 835 may include aplurality of sub-channels configured to access the output devices 840and 845. The sub-channels may include, for example, fiber-opticcommunications ports.

The processor 805 is a hardware device for executing software,particularly that stored in storage 820, such as cache storage, ormemory 810. The processor 805 can be any custom made or commerciallyavailable processor, a central processing unit (CPU), an auxiliaryprocessor among several processors associated with the computer 801, asemiconductor based microprocessor (in the form of a microchip or chipset), a macroprocessor, or generally any device for executinginstructions.

The memory 810 can include any one or combination of volatile memoryelements (e.g., random access memory (RAM, such as DRAM, SRAM, SDRAM,etc.)) and nonvolatile memory elements (e.g., ROM, erasable programmableread only memory (EPROM), electronically erasable programmable read onlymemory (EEPROM), programmable read only memory (PROM), tape, compactdisc read only memory (CD-ROM), disk, diskette, cartridge, cassette orthe like, etc.). Moreover, the memory 810 may incorporate electronic,magnetic, optical, and/or other types of storage media. Note that thememory 810 can have a distributed architecture, where various componentsare situated remote from one another, but can be accessed by theprocessor 805.

The instructions in memory 810 may include one or more separateprograms, each of which comprises an ordered listing of executableinstructions for implementing logical functions. In the example of FIG.8, the instructions in the memory 810 include a suitable operatingsystem (O/S) 811. The operating system 811 essentially controls theexecution of other computer programs and provides scheduling,input-output control, file and data management, memory management, andcommunication control and related services.

In an exemplary embodiment, a conventional keyboard 850 and mouse 855can be coupled to the input/output controller 835. Other output devicessuch as the I/O devices 840, 845 may include input devices, for examplebut not limited to a printer, a scanner, microphone, and the like.Finally, the I/O devices 840, 845 may further include devices thatcommunicate both inputs and outputs, for instance but not limited to, anetwork interface card (NIC) or modulator/demodulator (for accessingother files, devices, systems, or a network), a radio frequency (RF) orother transceiver, a telephonic interface, a bridge, a router, and thelike. The system 800 can further include a display controller 825coupled to a display 830. In an exemplary embodiment, the system 800 canfurther include a network interface 860 for coupling to a network 865.The network 865 can be an IP-based network for communication between thecomputer 801 and any external server, client and the like via abroadband connection. The network 865 transmits and receives databetween the computer 801 and external systems. In an exemplaryembodiment, network 865 can be a managed IP network administered by aservice provider. The network 865 may be implemented in a wirelessfashion, e.g., using wireless protocols and technologies, such as WiFi,WiMax, etc. The network 865 can also be a packet-switched network suchas a local area network, wide area network, metropolitan area network,Internet network, or other similar type of network environment. Thenetwork 865 may be a fixed wireless network, a wireless local areanetwork (LAN), a wireless wide area network (WAN) a personal areanetwork (PAN), a virtual private network (VPN), intranet or othersuitable network system and includes equipment for receiving andtransmitting signals.

When the computer 801 is in operation, the processor 805 is configuredto execute instructions stored within the memory 810, to communicatedata to and from the memory 810, and to generally control operations ofthe computer 801 pursuant to the instructions.

In an exemplary embodiment, the methods described herein can beimplemented with any or a combination of the following technologies,which are each well known in the art: a discrete logic circuit(s) havinglogic gates for implementing logic functions upon data signals, anapplication specific integrated circuit (ASIC) having appropriatecombinational logic gates, a programmable gate array(s) (PGA), a fieldprogrammable gate array (FPGA), etc.

In embodiments of the present disclosure, the simulated annealing module111 may comprise program code stored in the memory 810 and executed bythe processor 805. The data and references pointing to the data may bestored in the computer 801 or may be stored on other computers, servers,databases, or other network devices connected to the computer 801 via anetwork. The simulated annealing module 111 may further include hardwarecomponents, such as processors, memory and logic chips or structures forimplementing the simulated annealing.

As described above, embodiments can be embodied in the form ofcomputer-implemented processes and apparatuses for practicing thoseprocesses. An embodiment may include a computer program product 900 asdepicted in FIG. 9 on a computer readable/usable medium 902 withcomputer program code logic 904 containing instructions embodied intangible media as an article of manufacture. Exemplary articles ofmanufacture for computer readable/usable medium 902 may include floppydiskettes, CD-ROMs, hard drives, universal serial bus (USB) flashdrives, or any other computer-readable storage medium, wherein, when thecomputer program code logic 904 is loaded into and executed by acomputer, the computer becomes an apparatus for practicing theembodiments. Embodiments include computer program code logic 904, forexample, whether stored in a storage medium, loaded into and/or executedby a computer, or transmitted over some transmission medium, such asover electrical wiring or cabling, through fiber optics, or viaelectromagnetic radiation, wherein, when the computer program code logic904 is loaded into and executed by a computer, the computer becomes anapparatus for practicing the embodiments. When implemented on ageneral-purpose microprocessor, the computer program code logic 904segments configure the microprocessor to create specific logic circuits.

As will be appreciated by one skilled in the art, aspects of the presentdisclosure may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present disclosure may take theform of an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the present disclosure may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

Any combination of one or more computer readable medium(s) may beutilized. The computer readable medium may be a computer readable signalmedium or a computer readable storage medium. A computer readablestorage medium may be, for example, but not limited to, an electronic,magnetic, optical, electromagnetic, infrared, or semiconductor system,apparatus, or device, or any suitable combination of the foregoing. Morespecific examples (a non-exhaustive list) of the computer readablestorage medium would include the following: an electrical connectionhaving one or more wires, a portable computer diskette, a hard disk, arandom access memory (RAM), a read-only memory (ROM), an erasableprogrammable read-only memory (EPROM or Flash memory), an optical fiber,a portable compact disc read-only memory (CD-ROM), an optical storagedevice, a magnetic storage device, or any suitable combination of theforegoing. In the context of this document, a computer readable storagemedium may be any tangible medium that can contain, or store a programfor use by or in connection with an instruction execution system,apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent disclosure may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. The program code may execute entirely on theuser's computer, partly on the user's computer, as a stand-alonesoftware package, partly on the user's computer and partly on a remotecomputer or entirely on the remote computer or server. In the latterscenario, the remote computer may be connected to the user's computerthrough any type of network, including a local area network (LAN) or awide area network (WAN), or the connection may be made to an externalcomputer (for example, through the Internet using an Internet ServiceProvider).

Aspects of the present disclosure are described above with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of thepresent disclosure. It will be understood that each block of theflowchart illustrations and/or block diagrams, and combinations ofblocks in the flowchart illustrations and/or block diagrams, can beimplemented by computer program instructions. These computer programinstructions may be provided to a processor of a general purposecomputer, special purpose computer, or other programmable dataprocessing apparatus to produce a machine, such that the instructions,which execute via the processor of the computer or other programmabledata processing apparatus, create means for implementing thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present disclosure. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention tothe particular embodiments described. As used herein, the singular forms“a”, “an” and “the” are intended to include the plural forms as well,unless the context clearly indicates otherwise. It will be furtherunderstood that the terms “comprises” and/or “comprising,” when used inthis specification, specify the presence of stated features, integers,steps, operations, elements, and/or components, but do not preclude thepresence or addition of one more other features, integers, steps,operations, element components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements in the claims below are intended toinclude any structure, material, or act for performing the function incombination with other claimed elements as specifically claimed. Thedescription of the present disclosure has been presented for purposes ofillustration and description, but is not intended to be exhaustive orlimited to the disclosed embodiments. Many modifications and variationswill be apparent to those of ordinary skill in the art without departingfrom the scope and spirit of the embodiments of the present disclosure.

While embodiments of the present disclosure have been described above,it will be understood that those skilled in the art, both now and in thefuture, may make various improvements and enhancements which fall withinthe scope of the claims which follow.

1. A computer assembly for associating data with a target biologicalsequence, comprising: a processor configured to access data on a networkand to perform a method, the method comprising: identifying, in thenetwork, one or more references having a relevance level greater than apredetermined threshold, said references being at least one of a pointerand an address indicating a location of data or providing informationregarding the data; associating each reference of the one or morereferences to one or more probe sequences corresponding to one or morebiological sequences; ranking the one or more probe sequences based onone or more criteria corresponding to a target biological sequence; andassigning the one or more probe sequences with a level of affinity toone or more segments of the target biological sequence based at least onthe ranking of each of the one or more probe sequences.
 2. The computerassembly of claim 1, wherein the references are uniform resourcelocators (URLs).
 3. The computer assembly of claim 1, whereinassociating the one or more probe sequences to the one or morereferences having a relevance level greater than a predeterminedthreshold includes analyzing the one or more references to detect thepresence of one or more of key words, phrases, symbols and sources ofthe one or more references.
 4. The computer assembly of claim 1, whereinassociating each reference of the one or more references to one or moreprobe sequences includes associating each reference with a biologicalsequence that is complementary to a biological sequence referenced bythe reference.
 5. The computer assembly of claim 1, wherein ranking theone or more probe sequences comprises determining a similarity betweenthe one or more probe sequences and the one or more segments of thetarget biological sequence, and ranking the one or more probe sequencesfurther comprises at least one of determining an importance of a sourceof each reference of each probe sequence, determining a popularity ofeach reference of each probe sequence, and determining a historicalapplicability of each reference of each probe sequence to the one ormore segments of the target biological sequence.
 6. The computerassembly of claim 5, wherein determining the similarity between the oneor more probe sequences and the one or more segments of the targetbiological sequence includes determining a match between the one or moreprobe sequences and a complement of the one or more segments of thetarget biological sequence.
 7. The computer assembly of claim 6, whereinthe method further comprises associating the one or more probe sequenceswith at least one of a document, an analysis tool, and biographicalinformation of a person.
 8. The computer assembly of claim 7, whereindetermining an importance of a source of each reference includes atleast one of determining a number of citations of an author of thedocument, determining an organization to which an author of the documentbelongs, and determining a type of analysis performed by the analysistool, determining a popularity of each reference includes at least oneof determining a number of citations of the document and a frequency ofuse of the analysis tool, and determining a historical applicability ofeach reference includes at least one of determining a frequency withwhich the document has been cited in association with the segment of thetarget biological sequence and determining a frequency with which theanalysis tool has been used to analyze the segment.
 9. The computerassembly of claim 1, wherein the one or more probe sequences includes atleast two probe sequences corresponding to a same segment of the targetbiological sequence, and assigning the at least two probe sequences alevel of affinity to the segment of the target biological sequenceincludes competitively comparing the at least two probe sequences suchthat a reference having a higher ranking is assigned a higher level ofaffinity than a reference having a lower ranking.
 10. The computerassembly of claim 1, further comprising a display, wherein the methodfurther comprises displaying a graphical representation of the segmentof the target biological sequence on the display, and displaying agraphical representation of the level of affinity of each probe sequenceto the segment of the target biological sequence on the display byadjusting a physical distance of a graphical representation of the probesequence from the graphical representation of the segment based on thelevel of affinity of the probe sequence.
 11. A system for simulatingannealing to a biological sequence, comprising: one or more networkcomputers having stored therein data; and a host computer having storedtherein a biological sequence, the host computer connected to the one ormore network computers via a communications network, the host computerconfigured to identify data in the one or more network computers asrelevant data that is relevant to the biological sequence, to performone of identifying references to the data in the one or more networkcomputers and generating references to the data in the one or morenetwork computer, said references being at least one of a pointer and anaddress indicating a location of the data associate the relevant datawith a segment of the biological sequence, and rank the relevant databased on predetermined criteria applied to functions of the associatedsegment of the biological sequence to determine a level of affinity ofthe relevant data with the segment of the biological sequence.
 12. Thesystem of claim 11, wherein the host computer is configured to searchthe network for uniform resource locators (URLs) pointing to the datastored in the one or more network computers, to associate the URLs withthe data.
 13. The system of claim 11, wherein the host computer isconfigured to associate the relevant data with one or more probesequences, the one or more probe sequences corresponding to one or morerespective segments of the biological sequence.
 14. The system of claim13, wherein the host computer is configured to competitively rank theone or more probe sequences corresponding to a same segment of thebiological sequence, such that a probe sequence having a higher rankinghas a higher level of affinity to the segment of the biological sequencethan a probe sequence having a lower ranking.
 15. The system of claim13, wherein the host computer is configured to rank the one or moreprobe sequences based on a correspondence between the one or more probesequences and the segment of the biological sequence, and the hostcomputer is configured to further rank the one or more probe sequencesbased on at least one of an importance of a source of data associatedwith the one or more probe sequences, a popularity of the dataassociated with the one or more probe sequences, and a historicalapplicability of the data associated with the one or more probesequences to the segment of the biological sequence.
 16. The system ofclaim 15, wherein the data includes an analysis tool relevant to thesegment of the biological sequence, the importance of the source of dataassociated with the one or more probe sequences is based on at least oneof a source of the analysis tool and a type of analysis performed by theanalysis tool, a popularity of data associated with the one or moreprobe sequences is based on a frequency of use of the analysis tool, anda historical applicability of data associated with the one or more probesequences is based on a frequency with which an analysis tool has beenused to analyze the segment of the biological sequence.
 17. The systemof claim 15, wherein the data includes a document relevant to thesegment of the biological sequence, the importance of the source of dataassociated with the one or more probe sequences is based on at least oneof a number of citations of an author of the document and anorganization with which the author of the document is associated, apopularity of data associated with the one or more probe sequences isbased on at least one of a number of citations to the document, and ahistorical applicability of data associated with the one or more probesequences is based on a frequency with which the document has been citedin association with the segment of the biological sequence.
 18. Thesystem of claim 11, further comprising a display, wherein the hostcomputer is configured to display a graphical representation of thesegment of the biological sequence on the display and a graphicalrepresentation of the level of affinity of relevant data to the segmentof the biological sequence on the display by adjusting a physicaldistance of a graphical representation of the relevant data from thegraphical representation of the segment based on the level of affinityof the reference.
 19. The system of claim 11, wherein the host computeris further configured to associate the relevant data with a first probesequence, simulate annealing of the first probe sequence with thesegment of the biological sequence based on the determined level ofaffinity of the relevant data with the segment of the biologicalsequence, and simulate annealing of a second probe sequence with thefirst probe sequence based on a determined level of affinity of thesecond probe sequence with the first probe sequence.
 20. A computerprogram product for simulating annealing to a biological sequence,comprising: a processor; and a non-transitory computer readable mediumhaving stored thereon code to perform a method, comprising: identifying,by the processor, references to data in a network as relevant referencesthat are relevant to a biological sequence, said references being atleast one of a pointer and an address indicating a location of data orproviding information regarding the data; associating, by the processor,the relevant references with a segment of the biological sequence; andranking, by the processor, the relevant references based onpredetermined criteria to determine a level of affinity of the relevantreferences with the segment of the biological sequence.
 21. The computerprogram product of claim 20, wherein the method comprises the relevantreferences with one or more probe sequences, the one or more probesequences corresponding to one or more respective segments of thebiological sequence; and the host computer is configured tocompetitively rank the one or more probe sequences corresponding to asame segment of the biological sequence, such that a probe sequencehaving a higher ranking has a higher level of affinity to the segment ofthe biological sequence than a probe sequence having a lower ranking.22. The computer program product of claim 20, wherein ranking therelevant references includes ranking the one or more probe sequencesbased on a correspondence between the one or more probe sequences andthe segment of the biological sequence, and ranking the one or moreprobe sequences further includes ranking the one or more probe sequencesbased on at least one of an importance of a source of data associatedwith the one or more probe sequences, a popularity of the dataassociated with the one or more probe sequences, and a historicalapplicability of the data associated with the one or more probesequences to the segment of the biological sequence.
 23. The computerprogram product of claim 20 wherein the data includes at least one of adocument, an analysis tool, and biographical information of a person,determining an importance of a source of each reference includes atleast one of determining a number of citations of an author of thedocument, determining an organization to which an author of the documentbelongs, and determining a type of analysis performed by the analysistool, determining a popularity of each reference includes at least oneof determining a number of citations of the document and a frequency ofuse of the analysis tool, and determining a historical applicability ofeach reference to the segment includes at least one of determining afrequency with which the document has been cited in association with thesegment of the biological sequence and determining a frequency withwhich the analysis tool has been used to analyze the segment.
 24. Thecomputer program product of claim 20, wherein the references correspondto a same segment of the biological sequence, the method furthercomprising: determining a level of affinity of the relevant referenceswith the segment of the biological sequence includes competitivelycomparing the relevant references such that a reference having a higherranking is assigned a higher level of affinity than a reference having alower ranking.
 25. The computer program product of claim 20, the methodfurther comprising: displaying a graphical representation of the segmentof the biological sequence, and displaying a graphical representation ofthe level of affinity of each reference to the segment of the biologicalsequence on the display by adjusting a physical distance of a graphicalrepresentation of the reference from the graphical representation of thesegment based on the level of affinity of the reference.